Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Redirect conjuged verbs to their infinitive form #167

Closed
BoboTiG opened this issue Nov 1, 2020 · 31 comments · Fixed by #191
Closed

[FR] Redirect conjuged verbs to their infinitive form #167

BoboTiG opened this issue Nov 1, 2020 · 31 comments · Fixed by #191
Assignees

Comments

@BoboTiG
Copy link
Owner

BoboTiG commented Nov 1, 2020

As requested it would be cool to have conjuged verbs redirecting to their infinitive form instead of nothing.

I already tried some things, but without success. I think we could make use of variants, but it is not clear yet how to do that.

@lasconic
Copy link
Collaborator

lasconic commented Nov 6, 2020

Can you explain what was wrong with variants ?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 6, 2020

There is nothing wrong with variants, I just can't get to make it work (and I have not really tried though). There may be something to do with the Trie also, but I am lacking of time to have a serious look at it.

@lasconic
Copy link
Collaborator

lasconic commented Nov 6, 2020

Ok. I might have a look tonight or tomorrow.
My first idea would be to detect the verb-flexion like "mangeait" and store them, probably in another json file in a dictionary variants["manger"] = ["mangeait", "mangeais" ...].
They should all go in the Trie indeed and the dictionary can be used to lookup the variants for each infinitive.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 6, 2020

Yes, it is an idea. Maybe a second JSON file is not needed, just keep all words into one file is OK. But you do as you prefer, we could iterate on it then :)

@akorx
Copy link

akorx commented Nov 7, 2020

Ok. I might have a look tonight or tomorrow.
My first idea would be to detect the verb-flexion like "mangeait" and store them, probably in another json file in a dictionary variants["manger"] = ["mangeait", "mangeais" ...].
They should all go in the Trie indeed and the dictionary can be used to lookup the variants for each infinitive.

I think that listing all the forms will certainly take a long time, but it seems to be the only solution, but it's really the last big step to make the dictionary complete and make it the most functional.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

The time is not an issue ;)
I just hope the Kobo will not "freeze" while looking for a word if there are too many of them.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

@lasconic have a look at that commit for some details: 7b5c1f5

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

I see, but then what's the expected behavior ?

  • "silicone" is a noun but also the verb form in several tenses. I guess we don't want to redirect to "siliconer" right ? Do we want to list the definitions of silicone as a verb flexions (impératif, indicatif etc... de siliconer) though ?

  • "colligeait" is a pure verb flexion, so it should be redirected to colliger. It must be added in the Trie then and listed as a variant of "colliger" and it should work.

  • peut is also a pure verb flexion of "pouvoir" but its prefix "pe" is different from the prefix of "pouvoir"... So it must be added to the Trie and added in the pe.html file but with the definition of "pouvoir", ideally just the verb definition of "pouvoir" since "pouvoir" is also a noun...

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

Let's start the "easy" way and handle "pure" verbs like "pourraient" (when there is no ambiguity). It will be a great first move.

We will handle more complex words later.

Finally, you were right: using a separate JSON would help to manage those words.

WDYF?

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

Well, I believe it's not the easy way :) but let's see what can be done...

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

The easy case is when the two first letters are not changing between the form and the infinitive. It should work with the PR. I tested the test dict on my kobo and searching for "colligeait" displays "colliger" definition.

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

I built the full fr dictionary and it seems to work ok on my Kobo. The size increases is around 5MB and it's not noticeably slower.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

That's awesome!

I will have some time tomorrow for the review, else it will be Monday (sorry for the delay). The first thing I saw and you should have a look, is when a word has several flexions. Try "suis", what will be the result?

Small note: "pouvoir" and "pourraient" both start with the same letters, isn't it the easy case?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

I built the full fr dictionary and it seems to work ok on my Kobo. The size increases is around 5MB and it's not noticeably slower.

Out of curiosity, how many words in total?

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

Saved 1,530,014 words into data/fr/data.json

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

I added a fix for 3rd group verbs. I will test further with my current book!

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

I added a fix for 3rd group verbs. I will test further with my current book!

Out of curiosity, which book? :D

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

Nothing fancy but lot of verbs ;) "L'illusion" - Maxime Chattam
More fixes for 3rd group verbs.

And I found some issues I didn't fix yet...

  • "mirent", "peux" etc... are not supported: first letters are not the same for the conjugation and the infinitive. "mirent" resolves to "mire" and not "mirer", and even less "mettre"...
  • "minutes" and "courses" are now redirected to "minuter" and "courser" instead of "minute" and "course".

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 7, 2020

"mirent", "peux" etc... are not supported: first letters are not the same for the conjugation and the infinitive. "mirent" resolves to "mire" and not "mirer", and even less "mettre"...

Having 100% precision will be very very hard. If we can target 90% this is way better than any other dictionary ;)

"minutes" and "courses" are not redirected to "minuter" and "courser" instead of "minute" and "course".

Hm for such words we would lose the plural -> singular redirection 🤔

@lasconic
Copy link
Collaborator

lasconic commented Nov 7, 2020

"minutes" and "courses" are now redirected to "minuter" and "courser" instead of "minute" and "course".

Hm for such words we would lose the plural -> singular redirection 🤔

Yes, indeed... and I don't see how to fix it without another special case... looking for the a "{{S|nom|fr|flexion}}" and ignore the verb form in this case...

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 8, 2020

I just checked and it works pretty well 👍

Of course, this is not perfect: "sembla" or "avais" for instance. Overall, his is quite a good feature!

@lasconic
Copy link
Collaborator

lasconic commented Nov 8, 2020

  • "sembla" is a bit like minutes or courses, it's also a noun. I just discovered that it's possible to add several entries with the same word in the HTML file and Kobo should display them both ? I could be a solution. I would increase the number of words but it would also fix other issues : Currently a word which is a verb and a noun ("manger"?) has a mix of definition.

  • "avais" is like "peux" or "mirent". I should be added as a word in the av.html file with a duplicated definition. Or, depending on your policy on firmware support, it could use the new prefix_exception file. See Dictionary handling changes in 4.24.15672 pgaskin/dictutil#14 (comment)

BoboTiG added a commit that referenced this issue Nov 8, 2020
@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 8, 2020

"sembla" is a bit like minutes or courses, it's also a noun. I just discovered that it's possible to add several entries with the same word in the HTML file and Kobo should display them both ? I could be a solution. I would increase the number of words but it would also fix other issues : Currently a word which is a verb and a noun ("manger"?) has a mix of definition.

Yes, Kobo will display multiple words, for example "empire" will show "empire" but also "Empire". I never tested with 2 identic entries, that would be interesting :) I do not see the increase as an issue though.
The definition mix is also something cool to have, no need to change that IMO.

"avais" is like "peux" or "mirent". I should be added as a word in the av.html file with a duplicated definition. Or, depending on your policy on firmware support, it could use the new prefix_exception file. See pgaskin/dictutil#14 (comment)

We can start with the duplicated definition, WDYT?

@akorx
Copy link

akorx commented Nov 8, 2020

"mirent", "peux" etc... are not supported: first letters are not the same for the conjugation and the infinitive. "mirent" resolves to "mire" and not "mirer", and even less "mettre"...

Having 100% precision will be very very hard. If we can target 90% this is way better than any other dictionary ;)

"minutes" and "courses" are not redirected to "minuter" and "courser" instead of "minute" and "course".

Hm for such words we would lose the plural -> singular redirection 🤔

The only way to have 100% of success is to recover all the verbs in all their forms; you can't guess via an algorithm that "va" refer to"aller". But where can we find this list? the website "https://leconjugueur.lefigaro.fr/" has got it, so it exists somewhere...

Another more complex solution, but which would reduce the list of words to write in a file, is to do as the "bescherelle" which is a conjugation dictionary; it does not have the complete list of verbs on all forms, he knows only the exceptions like "va" that comes from "aller" and for the rest of the verbs he only use patterns.
Example:
"shooter", "lancer", "grogner", "esquiver" are modeled on the verb "parler" and will therefore all give identical forms when they are conjugated.
Conclusion : we would therefore rather have in one corner all the conjugated forms of models, example :
"{parl}er" => "{parl}ais", "{parl}as", "{parl}a", etc
And in another a list of verbs and theirs models on which they are based, it could be constructed as follows :
model : "{parl}er" => "{shoot}er", "{lanc}er", "{grogn}er", "{esquiv}er"
So if we look for "shootas", without having clearly written it in a library, we know that the verb "{shoot}er" exists under the form "shootas" because he is based of the model "{par}er" and therefore we can find the correct word "shooter" and find his definition in the dictionary.

Just a question : with the original dictionary of kobo reader, it's seems to me that it works with conjugated verbs, how does it work?

Sorry for my bad english, I'm french.

BoboTiG added a commit that referenced this issue Nov 8, 2020
@lasconic
Copy link
Collaborator

lasconic commented Nov 8, 2020

The only way to have 100% of success is to recover all the verbs in all their forms; you can't guess via an algorithm that "va" refer to"aller". But where can we find this list? the website "https://leconjugueur.lefigaro.fr/" has got it, so it exists somewhere...

We already have this information in wiktionary. The problem is how to store it in the dict file so it works 100% on the time on kobo devices.

Just a question : with the original dictionary of kobo reader, it's seems to me that it works with conjugated verbs, how does it work?
It doesn't. My understanding is that with these commits, the dict generated by this project is the only freely available french dict for kobo to have some sort of support for a large number of conjugated verb forms (1st, 2nd and 3rd groups if the two first letters are not different between the form and the infinitive, and there is no noun with the same spelling than a form)

Sorry for my bad english, I'm french.

Aren't we all? ;)

@akorx
Copy link

akorx commented Nov 8, 2020

Yes I think we are all french ;)

We are going to try to continue to speak in english... so what is the problem ? the search is to long or the dictionary to big for the storage?

PS :
I often develop in different languages ​​but have never done it via github so I do not master the process ... how are the search algorithm and the dictionary developed?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 8, 2020

so what is the problem ? the search is to long or the dictionary to big for the storage?

Hm I am not sure I followed everything right, but there is no problem for now 🤔
Everything is OK for now ;)

We are just talking about how to handle corner cases like "je suis" that should return "être".

how are the search algorithm and the dictionary developed?

In this project, nothing is developped regarding the search algo. We just provide the dictionary and Kobo will work with it. The search algo used is the one from the Kobo (few details and other details). To be more comfortable with how the dictionary works, you should have a look at "Trie" and read resources I linked in the previous sentence.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 8, 2020

Nicolas, have a look at avais with its template:

{{fr-verbe-flexion|grp=3|'=oui|ind.i.1s=oui|ind.i.2s=oui|avoir}}

It returns "'=oui" but "avoir" is expected.

@akorx
Copy link

akorx commented Nov 8, 2020

ok, I will try to have a look... the only solution to find "suis" is to use exceptions. Sometimes the word exist in a verb and a noun like "court" (verb : "courrir", and noun : "court" vs "long"). So you have to present the two définitions why not in two paragraph if we can do it, the first coming from the noun part of the dictionary and the second from the part of conjugated verbs.

I will try too the last update of the dictionary too see the result.

@lasconic
Copy link
Collaborator

lasconic commented Nov 8, 2020

{{fr-verbe-flexion|grp=3|'=oui|ind.i.1s=oui|ind.i.2s=oui|avoir}}

It returns "'=oui" but "avoir" is expected.

Done. See PR #204

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 8, 2020

I close this issue as the primary work has been merged. It will be easier to follow with specific issues (if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants