Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greedy prefix method sometimes fails for complex words #19

Closed
ProfessorO opened this issue Aug 28, 2016 · 3 comments · Fixed by #24
Closed

Greedy prefix method sometimes fails for complex words #19

ProfessorO opened this issue Aug 28, 2016 · 3 comments · Fixed by #24

Comments

@ProfessorO
Copy link
Owner

ProfessorO commented Aug 28, 2016

The word "unuenaskitoj" (firstborns) should parse as unu-e-nask-it-o-j, but because we grab the longest root that matches the front of the word, we instead get unu-en-as-ki-{toj} (where "toj" is an unknown root, which luckily doesn't mean anything in Esperanto, AFAIK). It looks like we're going to have to go with something like "if the greedy prefix method leaves some un-parseable sections, try something else (say, greedy suffix, or iterating through all possible parsings, which would be far slower)."

@ProfessorO
Copy link
Owner Author

Genesis 4:3, BTW. :)

@ProfessorO
Copy link
Owner Author

This will connect with the parsing of kia/kiam, kio/kiom, tia/tiam, tio/tiom. Right now the shorter root is in the database, and I can add "m" as a separate root, but it doesn't actually have its own meaning. However, with tiam in the list, tiamaniere (in Genesis 6:15), which should parse to tia-manier-e (in such a manner), instead parses as tiam-a-ni-er-e, which isn't even sort of right (it would mean at-that-time + adjective + we + part-of-the-whole + adverb).

I'm beginning to think the greedy algorithm is correct, but needs to select the biggest root that fits ANYWHERE in the word first (and then recursively parse the remaining pieces as two separate words, if that piece is in the middle of the word). Or perhaps try all possible parsings and pick the one with fewest roots. Or something like that. :/

@ProfessorO
Copy link
Owner Author

Another interesting example: aliris in Genesis 19:9. It should parse as al-ir-is, but since ali is a root (meaning other), it instead parses as ali-{ris} (where ris doesn't parse). This one would be fixed not by being greedy (ali is the longest root that fits in the word), but by backtracking until you find at least one collection of roots that allows parsing to complete.

ProfessorO added a commit that referenced this issue Sep 1, 2016
… first (LPF?). Fixes #19, but doesn't address #21 or #22, and actually creates #23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant