New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cg-proc -w mangles case #24
Comments
I think this one will be hard to deal with correctly without abandoning the "lemma-casing→form-casing" mechanism (where casing is {lower,title,upper}). That is, in Apertium we currently completely drop source forms while encoding the casing of the source form on the source lemma, so Some ideas:
|
Thanks for the ideas! After experimenting a bit with CG, I have managed to overcome the limitation for both languages, with two solutions that are pair-independent (everything is corrected in the language's CG after-section):
|
The wordform «Håndball-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now titlecases first subreading instead of uppercasing: /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ The wordform «HÅNDBALL-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now uppercases second subreading too: /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$ Unfortunately doesn't do anything for GrammarSoft#24 TODO: Could avoid some copying of UnicodeString's
The wordform «Håndball-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now titlecases first subreading instead of uppercasing: /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ The wordform «HÅNDBALL-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now uppercases second subreading too: /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$ We require at least two alphabetic uppercase characters before calling it uppercase (and no lowercase). This doesn't solve #24 but at least realises ^I/prpers<prn>$ as ^I/Prpers<prn>$ instead of ^I/PRPERS<prn>$, which is less shouty in the cases where people haven't yet made the CG rules for BOS vs non-BOS. Code now also uses pointer offsets to avoid unnecessary copying. git-svn-id: svn+ssh://beta.visl.sdu.dk/usr/local/svn/repos/visl/tools/vislcg3/trunk@13496 cb2587b3-b7ff-0310-8c81-a2a651690ada
The wordform «Håndball-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now titlecases first subreading instead of uppercasing: /Håndball<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ The wordform «HÅNDBALL-VM» used to give /HÅNDBALL<n><m><sg><ind><cmp><guio>+vm<n><nt><sg><ind>$ now uppercases second subreading too: /HÅNDBALL<n><m><sg><ind><cmp><guio>+VM<n><nt><sg><ind>$ We require at least two alphabetic uppercase characters before calling it uppercase (and no lowercase). This doesn't solve #24 but at least realises ^I/prpers<prn>$ as ^I/Prpers<prn>$ instead of ^I/PRPERS<prn>$, which is less shouty in the cases where people haven't yet made the CG rules for BOS vs non-BOS. Code now also uses pointer offsets to avoid unnecessary copying. git-svn-id: svn+ssh://beta.visl.sdu.dk/usr/local/svn/repos/visl/tools/vislcg3/trunk@13496 cb2587b3-b7ff-0310-8c81-a2a651690ada
See https://groups.google.com/forum/#!topic/constraint-grammar/pLpCsu-eUY4
I think this goes to @unhammer
The text was updated successfully, but these errors were encountered: