Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inadequate results in words separated by hyphen/minus #20

Open
pintassilgo opened this issue May 27, 2016 · 4 comments
Open

Inadequate results in words separated by hyphen/minus #20

pintassilgo opened this issue May 27, 2016 · 4 comments

Comments

@pintassilgo
Copy link

pintassilgo commented May 27, 2016

Page: https://addons.mozilla.org/en-US/firefox/files/browse/440201/file/files/foreground.js

  • Search for "tw" finds "font-weight";
  • search for "font-w" highlights "font-wei";
  • search for "t-w" finds "position.left - wndScrollLeft";
  • search for "x-wi" highlights "x-width".
@piroor
Copy link
Owner

piroor commented May 28, 2016

It is caused by the "Ignore modifiers of latin letters" option (activated by default). Actually, expanded versions of t is defined in the dictionary as:

https://github.com/piroor/xulmigemo/blob/master/dics/latin-letters-with-marks.txt#L23

t [tţťŧŢŤŦ]|t[¸ˇ-]

t- is regarded as a variation of Ŧ. Thus, an input tw produces a regular expression:

TWA|TWX|Twain|Tweed|Tweedledee|Tweedledum|Twila|Twinkie|Twp|Twyla|twaddle|twaddler|twain|twang|twangy|twas|tweak|twee|tweed|tweediness|tweedy|tween|tweet|tweeter|tweeze|tweezer|twelfth|twelfths|twelve|twelvemonth|twelvemonths|twentieths|twenty|twerp|twice|twiddle|twiddler|twiddly|twig|twigged|twigging|twiggy|twilight|twilit|twill|twin|twine|twiner|twinge|twinkle|twinkler|twinkling|twinkly|twinned|twinning|twirl|twirler|twirling|twirly|twist|twisted|twister|twists|twisty|twit|twitch|twitchy|twitted|twitter|twitterer|twittery|twitting|twixt|two|twofer|twofold|twopence|twopenny|twosome|twp|([tţťŧŢŤŦ]|t[¸ˇ-])([wẁẃẅŵẀẂẄŴ]|w[ˋ`ˊ´¨ˆ^])

You'll see same result by the code XMigemoCore.getRegExp('tw') at the browser console or the scratchpad. And, the rule matches some text including hyphens like:

var matcher = new RegExp("TWA|TWX|Twain|Tweed|Tweedledee|Tweedledum|Twila|Twinkie|Twp|Twyla|twaddle|twaddler|twain|twang|twangy|twas|tweak|twee|tweed|tweediness|tweedy|tween|tweet|tweeter|tweeze|tweezer|twelfth|twelfths|twelve|twelvemonth|twelvemonths|twentieths|twenty|twerp|twice|twiddle|twiddler|twiddly|twig|twigged|twigging|twiggy|twilight|twilit|twill|twin|twine|twiner|twinge|twinkle|twinkler|twinkling|twinkly|twinned|twinning|twirl|twirler|twirling|twirly|twist|twisted|twister|twists|twisty|twit|twitch|twitchy|twitted|twitter|twitterer|twittery|twitting|twixt|two|twofer|twofold|twopence|twopenny|twosome|twp|([tţťŧŢŤŦ]|t[¸ˇ-])([wẁẃẅŵẀẂẄŴ]|w[ˋ`ˊ´¨ˆ^])");
document.body.textContent.match(matcher); // => Array [ "t-w", "t-", "w" ]

You'll see same result on the web console.

After you turn off the checkbox and restart, an input tw produces a regular expression:

TWA|TWX|Twain|Tweed|Tweedledee|Tweedledum|Twila|Twinkie|Twp|Twyla|twaddle|twaddler|twain|twang|twangy|twas|tweak|twee|tweed|tweediness|tweedy|tween|tweet|tweeter|tweeze|tweezer|twelfth|twelfths|twelve|twelvemonth|twelvemonths|twentieths|twenty|twerp|twice|twiddle|twiddler|twiddly|twig|twigged|twigging|twiggy|twilight|twilit|twill|twin|twine|twiner|twinge|twinkle|twinkler|twinkling|twinkly|twinned|twinning|twirl|twirler|twirling|twirly|twist|twisted|twister|twists|twisty|twit|twitch|twitchy|twitted|twitter|twitterer|twittery|twitting|twixt|two|twofer|twofold|twopence|twopenny|twosome|twp|tw

This doesn't match to text like t-w.

@piroor
Copy link
Owner

piroor commented May 28, 2016

To be honest, I'm not familiar to actual usecases of such special characters. If t- (T-) never appears as an alternative of ŧ (Ŧ), I think I should remove the pattern from the dictionary.

@pintassilgo
Copy link
Author

"Ignore modifiers of latin letters", is THE feature of XULmigemo for me as brazilian portuguese native speaker (regex is useful too, but less). not have it is a serious flaw of Firefox, so thank you for this extension. I know Chrome does this by default. I don't have it installed, but surely it woudn't find "t-w" when searching for "tw", which I believe is appropriate behavior. Fastest Search also works properly ("ignore diacritics" option), but it's intrusive and doesn't work on Places and urlbar.

Maybe "t-w" should find "Ŧw" (I also unaware Ŧ), but "tw" shoudn't find "t-w".

Please also note it's highligting more than expected. "t-w" highlights "t-wei", "x-wi" highlights "x-width" and so on.

Another example: https://en.wikipedia.org/wiki/Cruzeiro_Esporte_Clube
"esporte çl" highlights "Esporte Club", two chars more.

piroor added a commit that referenced this issue May 28, 2016
@piroor
Copy link
Owner

piroor commented May 28, 2016

Maybe "t-w" should find "Ŧw" (I also unaware Ŧ), but "tw" shoudn't find "t-w".

OK, I've removed such patterns from the generated regular expressions by e216063. Thank you for the advice!

Please also note it's highligting more than expected. "t-w" highlights "t-wei", "x-wi" highlights "x-width" and so on.

It is caused by dictionary-assisted search feature. As I commented at #20 (comment) XUL/Migemo lists terms extracted from the dictionary so you'll see such expanded matching results.

Initially XUL/Migemo was developed to assist incremental search for Japanese people and such dictionary-assisted search is very required for us. In Japanese text, same term can appear in different forms. For example, "Japan" can be "nihon", "にほん", "ニホン", or "日本". Moreover, in Japanese, the input "nihon" ("ni-hon") can mean "double" so we possibly want to find more terms like "2本", "二本" from same input. Then XUL/Migemo generates a large regexp ike "nihon|にほん|ニホン|日本|2本|二本" and find it from the webpage. Thus we don't need to input exact term to search - we can search various terms via simple ASCII input.

Sadly, English dictionary is not designed to improve your search experience like for Japanese text. (I just added the en-US mode to pass the editors' review on AMO.) If you have any idea to improve your search experience with well-designed dictionary, please send pull request to modify it https://github.com/piroor/xulmigemo/tree/master/dics/en-US freely :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants