Inadequate results in words separated by hyphen/minus #20

pintassilgo · 2016-05-27T20:23:33Z

Page: https://addons.mozilla.org/en-US/firefox/files/browse/440201/file/files/foreground.js

Search for "tw" finds "font-weight";
search for "font-w" highlights "font-wei";
search for "t-w" finds "position.left - wndScrollLeft";
search for "x-wi" highlights "x-width".

piroor · 2016-05-28T01:50:40Z

It is caused by the "Ignore modifiers of latin letters" option (activated by default). Actually, expanded versions of t is defined in the dictionary as:

https://github.com/piroor/xulmigemo/blob/master/dics/latin-letters-with-marks.txt#L23

t [tţťŧŢŤŦ]|t[¸ˇ-]

t- is regarded as a variation of Ŧ. Thus, an input tw produces a regular expression:

TWA|TWX|Twain|Tweed|Tweedledee|Tweedledum|Twila|Twinkie|Twp|Twyla|twaddle|twaddler|twain|twang|twangy|twas|tweak|twee|tweed|tweediness|tweedy|tween|tweet|tweeter|tweeze|tweezer|twelfth|twelfths|twelve|twelvemonth|twelvemonths|twentieths|twenty|twerp|twice|twiddle|twiddler|twiddly|twig|twigged|twigging|twiggy|twilight|twilit|twill|twin|twine|twiner|twinge|twinkle|twinkler|twinkling|twinkly|twinned|twinning|twirl|twirler|twirling|twirly|twist|twisted|twister|twists|twisty|twit|twitch|twitchy|twitted|twitter|twitterer|twittery|twitting|twixt|two|twofer|twofold|twopence|twopenny|twosome|twp|([tţťŧŢŤŦ]|t[¸ˇ-])([wẁẃẅŵẀẂẄŴ]|w[ˋ`ˊ´¨ˆ^])

You'll see same result by the code XMigemoCore.getRegExp('tw') at the browser console or the scratchpad. And, the rule matches some text including hyphens like:

var matcher = new RegExp("TWA|TWX|Twain|Tweed|Tweedledee|Tweedledum|Twila|Twinkie|Twp|Twyla|twaddle|twaddler|twain|twang|twangy|twas|tweak|twee|tweed|tweediness|tweedy|tween|tweet|tweeter|tweeze|tweezer|twelfth|twelfths|twelve|twelvemonth|twelvemonths|twentieths|twenty|twerp|twice|twiddle|twiddler|twiddly|twig|twigged|twigging|twiggy|twilight|twilit|twill|twin|twine|twiner|twinge|twinkle|twinkler|twinkling|twinkly|twinned|twinning|twirl|twirler|twirling|twirly|twist|twisted|twister|twists|twisty|twit|twitch|twitchy|twitted|twitter|twitterer|twittery|twitting|twixt|two|twofer|twofold|twopence|twopenny|twosome|twp|([tţťŧŢŤŦ]|t[¸ˇ-])([wẁẃẅŵẀẂẄŴ]|w[ˋ`ˊ´¨ˆ^])");
document.body.textContent.match(matcher); // => Array [ "t-w", "t-", "w" ]

You'll see same result on the web console.

After you turn off the checkbox and restart, an input tw produces a regular expression:

TWA|TWX|Twain|Tweed|Tweedledee|Tweedledum|Twila|Twinkie|Twp|Twyla|twaddle|twaddler|twain|twang|twangy|twas|tweak|twee|tweed|tweediness|tweedy|tween|tweet|tweeter|tweeze|tweezer|twelfth|twelfths|twelve|twelvemonth|twelvemonths|twentieths|twenty|twerp|twice|twiddle|twiddler|twiddly|twig|twigged|twigging|twiggy|twilight|twilit|twill|twin|twine|twiner|twinge|twinkle|twinkler|twinkling|twinkly|twinned|twinning|twirl|twirler|twirling|twirly|twist|twisted|twister|twists|twisty|twit|twitch|twitchy|twitted|twitter|twitterer|twittery|twitting|twixt|two|twofer|twofold|twopence|twopenny|twosome|twp|tw

This doesn't match to text like t-w.

piroor · 2016-05-28T01:55:40Z

To be honest, I'm not familiar to actual usecases of such special characters. If t- (T-) never appears as an alternative of ŧ (Ŧ), I think I should remove the pattern from the dictionary.

pintassilgo · 2016-05-28T03:40:04Z

"Ignore modifiers of latin letters", is THE feature of XULmigemo for me as brazilian portuguese native speaker (regex is useful too, but less). not have it is a serious flaw of Firefox, so thank you for this extension. I know Chrome does this by default. I don't have it installed, but surely it woudn't find "t-w" when searching for "tw", which I believe is appropriate behavior. Fastest Search also works properly ("ignore diacritics" option), but it's intrusive and doesn't work on Places and urlbar.

Maybe "t-w" should find "Ŧw" (I also unaware Ŧ), but "tw" shoudn't find "t-w".

Please also note it's highligting more than expected. "t-w" highlights "t-wei", "x-wi" highlights "x-width" and so on.

Another example: https://en.wikipedia.org/wiki/Cruzeiro_Esporte_Clube
"esporte çl" highlights "Esporte Club", two chars more.

…gnore modifiers of latin letters" option #20

piroor · 2016-05-28T08:34:18Z

Maybe "t-w" should find "Ŧw" (I also unaware Ŧ), but "tw" shoudn't find "t-w".

OK, I've removed such patterns from the generated regular expressions by e216063. Thank you for the advice!

Please also note it's highligting more than expected. "t-w" highlights "t-wei", "x-wi" highlights "x-width" and so on.

It is caused by dictionary-assisted search feature. As I commented at #20 (comment) XUL/Migemo lists terms extracted from the dictionary so you'll see such expanded matching results.

Initially XUL/Migemo was developed to assist incremental search for Japanese people and such dictionary-assisted search is very required for us. In Japanese text, same term can appear in different forms. For example, "Japan" can be "nihon", "にほん", "ニホン", or "日本". Moreover, in Japanese, the input "nihon" ("ni-hon") can mean "double" so we possibly want to find more terms like "2本", "二本" from same input. Then XUL/Migemo generates a large regexp ike "nihon|にほん|ニホン|日本|2本|二本" and find it from the webpage. Thus we don't need to input exact term to search - we can search various terms via simple ASCII input.

Sadly, English dictionary is not designed to improve your search experience like for Japanese text. (I just added the en-US mode to pass the editors' review on AMO.) If you have any idea to improve your search experience with well-designed dictionary, please send pull request to modify it https://github.com/piroor/xulmigemo/tree/master/dics/en-US freely :)

piroor added a commit that referenced this issue May 28, 2016

Don't match to un-normalized versions of latin characters with the "I…

73e8b32

…gnore modifiers of latin letters" option #20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inadequate results in words separated by hyphen/minus #20

Inadequate results in words separated by hyphen/minus #20

pintassilgo commented May 27, 2016 •

edited

piroor commented May 28, 2016 •

edited

piroor commented May 28, 2016

pintassilgo commented May 28, 2016

piroor commented May 28, 2016 •

edited

Inadequate results in words separated by hyphen/minus #20

Inadequate results in words separated by hyphen/minus #20

Comments

pintassilgo commented May 27, 2016 • edited

piroor commented May 28, 2016 • edited

piroor commented May 28, 2016

pintassilgo commented May 28, 2016

piroor commented May 28, 2016 • edited

pintassilgo commented May 27, 2016 •

edited

piroor commented May 28, 2016 •

edited

piroor commented May 28, 2016 •

edited