Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

french dictionnary (apostrophe case) #109

Open
menny opened this issue Mar 25, 2013 · 55 comments
Open

french dictionnary (apostrophe case) #109

menny opened this issue Mar 25, 2013 · 55 comments
Assignees
Milestone

Comments

@menny
Copy link
Member

menny commented Mar 25, 2013

https://code.google.com/p/softkeyboard/issues/detail?id=573

The comment #7 explain everything : the apostrophe and hyphen should be concider as a normal word separator.

@ghost ghost assigned menny Mar 25, 2013
@bryanparadis
Copy link

Hey Menny,

Just started diving through code to look to see how the application was structured and where I could possibly add a check for french language and to treat the ' as a separator. Decided to see if had been listed as an issue or not :)

👍

Regards,

Bryan

@menny
Copy link
Member Author

menny commented Apr 2, 2013

Good thinking :-)

Actually, the fix is in the French Language Pack, and not in ASK. I'll handle that soon.

@bryanparadis
Copy link

What kind of medium in the French Language Pack are you going to use to fix
the issue?

On Tue, Apr 2, 2013 at 4:07 AM, Menny Even Danan
notifications@github.comwrote:

Good thinking :-)

Actually, the fix is in the French Language Pack, and not in ASK. I'll
handle that soon.


Reply to this email directly or view it on GitHubhttps://github.com//issues/109#issuecomment-15761894
.

@bryanparadis
Copy link

Hey Menny,

I have been thinking about this issue quite a bit with word
separators. Take french for example there are two common ones - and '
...

Examples:
aujourd'hui (one word that needs to be looked at for completion but
with a ' inner letter) not in regular dictionary but does work once
added because inner letter
peut-être (one word that needs to be looked at for completion but with
a - inner letter) I miss this one lots because once at the - it
changes to start state

Currently ASK treats the hyphen as a word separator as the state
changes to start of a word again after the hyphen unlike the ' . There
are two problems because this complicates your implementation of
checking for inner letters against a list as in? If I am following
correctly in the source. Was looking at these areas below

https://github.com/AnySoftKeyboard/AnySoftKeyboard/blob/master/src/com/anysoftkeyboard/AnySoftKeyboard.java#L1748-L1757
https://github.com/AnySoftKeyboard/AnySoftKeyboard/blob/master/src/com/anysoftkeyboard/AnySoftKeyboard.java#L2550-L2583

Here are some examples of the usage that complicates the current
implementation...

Examples:

Veux-tu (two words question inversion used often ex: veux-tu manger?
You want to eat?) Currently works because it is a separator.

D'acheter (two words forming a contraction in this case of the
preposition de and verb acheter, to buy ex: Je viens d'acheter quelque
chose. I just bought soomething) Currently after the ' it continues
looking for suggestions instead of starting new because it is listed
as an inner letter I think.

So as you can see that with both sets of examples that hyphens and
single quotes need to work in both ways. Looks like there needs to be
another state added that does maybe checks to see if there are any
compound noun suggestions available and if not start a new work. Then
at least it would be able to be used if manually added to dictionary.
I think I have managed to guide myself around to see where things are
being done but then again I may be well off mark. It is after 6am and
I am dog tired.

Regards,

Bryan Paradis

@menny
Copy link
Member Author

menny commented Apr 4, 2013

OMG! Are you kidding me!

Ok, what if:
Add - and ' to the list of separators. But if, when the user is typing the separator there is a word in the dictionary that starts with the typed words and the separator, then I wont consider it as a separator?
For example:
If the user type aujourd' the ' will not consider as separator because there is a word aujourd'hui in the dictionary, but if the user types Veux- it will be consider as a separator because there is no such word in the dictionary.

@bryanparadis
Copy link

Ill have to write down more examples and go through and see what the logic
is.

For example: Peut-il m'aider? vs Peut-être

Have a check here as well for different compounds in French. Any of them
that are verb compounds in the noun could end up in an inversion.

http://french.about.com/od/grammar/a/compoundnouns_2.htm

On Wed, Apr 3, 2013 at 11:04 PM, Menny Even Danan
notifications@github.comwrote:

OMG! Are you kidding me!

Ok, what if:
Add - and ' to the list of separators. But if, when the user is
typing the separator there is a word in the dictionary that starts with the
typed words and the separator, then I wont consider it as a separator?
For example:
If the user type aujourd' the ' will not consider as separator
because there is a word aujourd'hui in the dictionary, but if the user
types Veux- it will be consider as a separator because there is no such
word in the dictionary.


Reply to this email directly or view it on GitHubhttps://github.com//issues/109#issuecomment-15877692
.

@bryanparadis
Copy link

Deleted comment submitted through email that destroyed formatting. Have recommented in a formated way.


Current functionality Examples:

As separator:
pe = le
peu = peut | leur | peuvent
peut = peut | leur | peur | peut-être
peut- = (new word)
peut-ê = (below suggestion threshold?)
peut-êt = êt | être | tête | vêt | êtes (below auto-correct threshold?)

Simulated as inner letter:
pe = le
peu = peut | leur | peuvent
peut = peut'être | leur | peur
peut' = peut'être | leur | peur
peut'ê = peut'être
etc

Almost all the time the compounds with hyphens are going to be a verb followed by a pronoun as in the French imperative. Meaning the word is going to very simple 1-4 letters. Le, la, les, y, en, moi, toi, il, lui, elle, nous vous etc.

Allez-y! Go there!
Manges-en! Eat some!
Dites-le-lui! Tell him it!

It is probably more important to guess a compound noun then the second word as the length of the compound would make it more beneficial to be auto corrected. I think there is a way to do both.

More examples:

(Expression compound noun vs verb inversion)
Ex: Peut-être vs Peut-il
Peut-être
Peut- = Peut-être
Peut-êt = Peut-être
Peut-le = le (Peut-être no longer a possability so autocorrect should do le)

(triple compound nouns)
Ex: va-et-vient
va- = va-et-vient
va-et- = va-et-vient
va-et-b = va-et-vient (b proximity to v the suggesting function should still work it out right?)

(triple compound verb inversion due to vowel phonetic rules va-il = va-t-il)
Ex: va-t-il
va-t = t
va-t- = new word
va-t-il = il

(triple compound expression made of demonstrative pronoun ce, verb est, preposition à, verbe infinitive dire)
Ex: c'est-à-dire

(I am going to buy that which is a contraction of la or le and the vowel a in acheter)
Ex: Je vais l'acheter

Correction Scenario:

(Correctable typing error by key promixity before hyphen in compound noun)
Ex: Peut-être
Prut + - = Peut-
Peut- + space = Peut-être (Should be first compound suggestiong)

(Correctable typing error by key proximity before hyphen in non compound noun but verb inversion)
Ex: Peut-il
Prut + - = Peut-
Peut- = Peut-être
Peut-i =
Peut-il = il

(Auto correction with a contraction. If were to treat apostrophes as separators like hyphens. French Contractions)
Ex: Jusqu'à, jusqu'alors, jusqu'ici, jusqu'où, jusqu'au, etc
Jisqu + - = jusque (Because jusqu' is not in the dictionary you run into a problem here)

Conclusions:

a) Autocorrections when you press hyphen in french are work ok and may result in a new autocorrect suggestion of a compound noun. Sighted in the Peut-être correction scenario Prut +- = Peut- & Peut- = Peut-être.

Works

b) Autocorrections when you press apostrophe in french are would be a problem unless you have solved this in your dictionary. It will autocorrect to full form plus the apostrophe if the word is not in your dictionary listed in the alternate form Jusque vs Jusqu' or de and d' que and qu' etc.

Jusque = Jusqu'à, jusqu'alors, jusqu'ici, jusqu'où, jusqu'au, etc
Puisque = Puisqu'il Puisqu'elle, puisqu'ils, puisqu'elles
Quelque = Quelqu'un Quelqu'une (Though these two really should be in the dictionary)
Que = Qu'il, Qu'elle, Qu'ils, Qu'elles, etc

Problem: Could be potentially fixed with a dictionary that lists the contracted form of the first word. EX: Que and Qu'

c) Separators in their current form can be problematic. If there are too many other higher priority suggestions for autocorrect before you get to the hyphen in the word especially if you make a typo.

Problem: Could be remedied by creating a different state when encountering a separator instead of just a new state? If you were to create a state that would continue to make suggestions past the separator.

Final musings:

Separators work better then inner letters as far as autocorrect goes because at least you could autocorrect two words or more in a row and still end up with the compound, although rather inefficiently. This is because almost all compound nouns should be already in the dictionary in single forms except for vowel contractions! Qu', puisqu' etc.

Having separators and inner letters causes some problems. Wouldn't it be better to allow the suggestion engine to continue checking at and after the separator against the dictionary for the whole string? Let it decide if there are no suggestions or possibly compound nouns left. When it comes to that conclusion it should start searching for only the section of the string after the separator? Can you think of any reasons that this would cause problems?

I mean you could make the autocorrect optionally more intelligent by placing rules depending on which language is loaded. Maybe for another time. I could see a lot of things you could look for and check depending on vowels and all sorts of things!

Other improvements

Keyboard keys

  • 'e' default long press character is è which is not efficient considering there is a dedicated è key. I would love to see this changed to ê as it is used often: être, peut-être, tête, bête etc.
  • 'a' default long press character is à which is not efficient either considering it has a dedicated à key. Probably should be changed to be â as it would be more common then æ.
  • Make hyphen more accessible somehow
  • Accents other than the ones below could probably be removed from long press keys:

Acute accent (é)
Grave accent (à, è, ù)
Circumflex (â, ê, î, ô, û)
Diaeresis or tréma (ë, ï, ü, ÿ)
Cedilla (ç)
The Tilde Diacritical Mark (ñ)
The Two Ligatures (œ) and (æ)

  • Add long press popup on backspace to delete text because backspace just takes too long if you decide your long story isn't worth sending :)

Anyway! Hope it helps. Maybe I can take a poke at the layout changes or whatever for french usage. Tried to make the email as clear as possible with formatting.

Cheers,

Bryan Paradis

@Evpok
Copy link

Evpok commented Mar 1, 2015

Any news on this? I didn't find the French language pack sources, but in the store version the issue is still there.

@breversa
Copy link

breversa commented Mar 8, 2015

On a related note, I'd love to build a newer/better version of the french dictionnary. I find the current one lacking in many ways ; for instance, several common conjugations are missing.

Any howto would be helpful ! :-)

@Evpok
Copy link

Evpok commented Mar 8, 2015

@Brevera If you do, count me in.

@breversa
Copy link

I did a bit of research and here are my findings so far :

So I guess all that's left to do is build a french language pack based on dicollecte's lexique.

If only I could get my hands on that howto I found some day…

@menny
Copy link
Member Author

menny commented Mar 15, 2015

That's very interesting!

What is your programming and Android Development experience?

@breversa
Copy link

Though I have a developer's degree (mainly Java, C and PHP), it dates
back to the early 2000s and I've never thought myself of being a
developper. So no real experience, and definitively no Android one,
but I guess I have a semi-functionnal brain and I like to find or even
sometimes provide solutions. :-)

What did you have in mind ?

2015-03-15 18:29 GMT+01:00 Menny Even Danan notifications@github.com:

That's very interesting!

What is your programming and Android Development experience?


Reply to this email directly or view it on GitHub.

@breversa
Copy link

… and here's the building guide I was thinking of : https://code.google.com/p/softkeyboard/wiki/BinaryDictionaries :-)

@menny
Copy link
Member Author

menny commented Mar 15, 2015

This SO outdated!

Basically, https://github.com/AnySoftKeyboard/LanguagePack is the base of language packs (see branches for concrete implementations)
But, I doubt it works right now, because all of the dependencies have changed.

You can have a go at that 😃

@breversa
Copy link

Yeah, that's what I feared and was about to ask.

I guess the least I can do is provide a french words.xml file with dicollecte's data. Would that be enough for you (or anyone else) to update the French language pack ?

And BTW, do the elements in words.xml need to be ordered according to their frequency (= "f" value) or not ?

@menny
Copy link
Member Author

menny commented Mar 15, 2015

Yes, words XML is good enough. Attach it here.
And, yes, f is the frequency

@breversa
Copy link

I'm still a beginner regarding GitHub, so… how do you attach files to an issue comment ?

@Evpok
Copy link

Evpok commented Mar 15, 2015

If you are looking for an extensive dictionary, the Lefff is the most complete there is. It doesn't have frequency data, though.

@breversa
Copy link

Nevermind, I think I found a solution : https://my.owndrive.com/public.php?service=files&t=8e5c75acbf7e8766b2eb6efb09d24fa7

And here is the small awk script I wrote to generate the above file : https://my.owndrive.com/public.php?service=files&t=3486b43d01e665b80f25c9d62f7f1007

@breversa
Copy link

Thanks Evpok. I've had a (lightning-)quick look at the LEFFF.

However, I don't think an extensive dictionnary is the best thing for a phone dictionnary : it's the most common words that are needed, not necessarily the most exotic ones. YMMV, though, if you have specific needs.

@Evpok
Copy link

Evpok commented Mar 15, 2015

Just saying :) However it still doesn't help with our tokenisation problem.

@breversa
Copy link

Oops, yeah, sorry about hijacking the initial issue… :-/

@breversa
Copy link

Hi Menny, I just wanted to know if the fr.xml file I provided 3 posts above was enough for you to generate a new french dictionnary ?

@xavihernandez
Copy link

#540

@djibux
Copy link

djibux commented Dec 15, 2015

Hello. Is someone currently looking into that issue?

@breversa
Copy link

I'm afraid no one is... 😞
Le 15 déc. 2015 11:04, djib notifications@github.com a écrit :Hello. Is someone currently looking into that issue?

—Reply to this email directly or view it on GitHub.

@djibux
Copy link

djibux commented Dec 15, 2015

Thanks @xavihernandez. Very unfortunately my device isn't supported by CyanogenMod. I use a stock version of android that I have degooglized myself. Google Keyboard was the available keyboard.

@xavihernandez
Copy link

@djibux Oh yeah. Well, if you use Google Keyboard you will have an idea about AOSP keyboard because they have the same base. I don't really know what is closed-source in Google Keyboard tho...

@menny menny added this to the backlog milestone Jan 12, 2016
@xavihernandez
Copy link

@menny thanks, hope it get integrated soon =)

@breversa
Copy link

Hi @menny, thanks for looking into this. :-)

Shall I provided an updated version of the fr.xml file ?

@bryanparadis
Copy link

Been awhile since I checked back here. Cool stuff! I been too busy working for a long time now (:

Good luck with the enhancement!

@homlett
Copy link

homlett commented Oct 28, 2016

That would be amazing that this issue would be solved!

Idk if it's the right place, but about the French keyboard lacks, guillemets (the French quotes style: « and ») are missing also.
https://en.m.wikipedia.org/wiki/Guillemet

@stragu
Copy link

stragu commented Oct 28, 2016

@homlett The azerty keyboard has them with a long press on the "/' key, but having the French guillemets by default would probably be best (i.e. a «/'/» key). And maybe I'm a dreamer, but ideally it would automatically insert a non-breaking space after "«" and before "»" 😉

@homlett
Copy link

homlett commented Oct 28, 2016

I don't find them... I'm using AnySoftkeyBoard 1.8.195 and the French pack 20111029 (both from F-Droid).

Otherwise, you're right about the french style quotes key and non-breaking spaces. That would be very nice (like swipes, but it's another issue!).

Edit: Neither with de '/' key:

Only with the bépo keyboard:

@stragu
Copy link

stragu commented Oct 29, 2016

Ok I figured out what the difference is: we have the exact same software versions, but I think that you are using the regular "common bottom generic row" whereas I am using the "new generation - testing" (go to User interface > Even more... > Common bottom generic row).

And to clarify my previous comment: I was not talking about the / key, but rather using the character as a separator :)

@homlett
Copy link

homlett commented Oct 29, 2016 via email

@papjul
Copy link

papjul commented Jan 18, 2017

I’m working on an updated French keyboards + dictionary and I’m running into this issue.
Is ASK based on AOSP keyboard? Because if so, this is how Google is handling different typography in different languages:

@friesenkiwi
Copy link
Contributor

See AnySoftKeyboard/LanguagePack#12
Maybe the original issue (and some of the follow-ups here ;-)) is already solved with the new dictionary?
@papjul how far did you progress with your pack? Do you think, you can put up a PR as soon as the resurrected sources have been merged?

@papjul
Copy link

papjul commented Feb 17, 2017

I have designed French/Belgian and Canadian keyboards (with accents versions), though I'm still waiting for a fix in another issue for long pressed characters.
I have a working updated Dicollecte dictionary. However, I tried with the AOSP dictionary and it is buggy, words with accentuated letters are not shown, we should not use it, but Dicollecte dictionary is better anyway so I don't really care for the moment. I also tried to merge AOSP english+AOSP french dictionaries for people using both languages like me, but since AOSP french doesn't work well, we can't use it. I can't merge AOSP english with Dicollecte french because frequency is not on the same scale.
I will release sources when I'm home but it is still very experimental.

Edit: To answer your question, no, neither AOSP dictionary or Dicollecte dictionary supports correctly the apostrophe case. If all apostrophe cases were to be included in a dictionary, the dictionary would be insanely huge, so the apostrophe case needs to be fixed first. Meanwhile, I have stopped working on the pack, since I'm using the AOSP keyboard which handles it correctly, ASK being unusable without this fix.

@breversa
Copy link

breversa commented Feb 17, 2017 via email

@papjul
Copy link

papjul commented Feb 17, 2017

That's not how the frequency works in Dicollecte.
If you make a scale from 255 to 1, you will have the most popular in French at 255 and the second most popular word at 128, and so on until the 10th most popular at about 2 or 3. Because the frequency is calculated to the number of times it is used in French sentences. However, scale in AOSP dictionaries doesn't work like this, this is far more linear.

@friesenkiwi
Copy link
Contributor

Let's concentrate the discussion about a new French Dictionary at AnySoftKeyboard/LanguagePack#12

@anysoftkeyboard-bot
Copy link
Member

This issue is stale because it has been open 400 days with no activity. Remove stale label or this will be closed in 8 days

@MayeulC
Copy link

MayeulC commented Feb 27, 2020

Hey there, did anybody make any progress on this issue?

This is still a very annoying issue when writing French (together with #1332 and "ca" being corrected to "cA" instead of "ça": #540 ). To give you a better idea of the scope of the issue, here is a screenshot of (a small part of) my user dictionary after using ASK for about a month on a new phone:

image

This is especially annoying when swiping, as one of the apostrophed words is usally short, I tend to swipe both in one go, but I am unsure if the apostrophe is recognized as a character, even when the swiped word exists in the user dictionary.

As a side note, maybe this could be somewhat sidestepped by making use of the "typographic apostrophe on both the keyboard and dictionary.

@menny menny removed the Stale label Feb 27, 2020
@Silmathoron
Copy link

Yep, also having issue with this... could somebody tell us how we could contribute to improving this? Would sharing user dict be helpful or is this a matter of adding some kind of grammar rule in a parser?

@jlemonde
Copy link

I experience the same kind of issues, it is very annoying that the apostrophes are not correctly predicted. I would like that when I type a word by omitting the apostrophe, it recognises that there should be one, for instance jai → j'ai, aujourdhui → aujourd'hui, and so on. For now it does not. Is a fix possible ?

@PrSunflower
Copy link

Hello! Has the issue been solved?

I typed m'avertir correctly but the AZERTY keyboard autocorrected into l'avertir.

Screenshot_20200814-162440_1

@MayeulC
Copy link

MayeulC commented Aug 14, 2020

Hello @PrSunflower =. No, it hasn't, unfortunately. You have probably typed "l'avertir" often enough that it is in your user dictionary, so that ASK suggests it.

@PrSunflower
Copy link

Hi @MayeulC ,

Oh too bad. Thanks for the suggestion, but I have checked my AZERTY dictionary and l'avertir is not in the list.

Screenshot_20200815-111440_1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests