Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deps updates since commit 86ee9c6 cause CJK characters *fuzzy* search throw error message of "undefined" #87

Closed
c0001 opened this issue Mar 23, 2023 · 12 comments

Comments

@c0001
Copy link
Contributor

c0001 commented Mar 23, 2023

Hi maintainer:

For recent 1.8.5 release checking out on my usage side, all of fuzzy search method (i.e. any search mode unless precise) thru CJK (i.e. chinese, japan and korean) prefix gives me an error message of:

Error Message:  undefined

without any candidates listed anymore.

I've spent some time to investigate the problem by checking out recently git commits for locating the final no-problem commit, it is 98ee21e. And I've compared the latest commit upates content again with that no-problem commit to found that the only break changes between is the deps updates via commit 86ee9c6.

In my test, when I revert the deps updates via commit 86ee9c6 based on latest commit (i.e. the main branch), the problem is gone.

I've not test which dep's updating caused this issue, hope for founding and helping me how to fix this problem.

Thanks.

@Fannon
Copy link
Owner

Fannon commented Mar 26, 2023

Hi @c0001,

thanks for the report and the investigation!
From what you write, I'm rather sure that the problem is actually coming from the https://github.com/leeoniya/uFuzzy library. There we already have a newer version. I could try a dependency update to that and if that works, I can create a new release. If not, could you consider creating your report at the uFuzzy GitHub issues?

There seems to be one issue already: leeoniya/uFuzzy#14

Could you also send me some text examples which cause the issue? Then I can use this to replicate.

Thanks!
Simon

@c0001
Copy link
Contributor Author

c0001 commented Mar 27, 2023

I've check the leeoniya/uFuzzy#14 issue which says ufuzzy may
need a more complex regexp specified for its option entry while used
for CJK terms based search, and below is my experimentation snippet:

const ufuzzy = require('@leeoniya/ufuzzy')

let opts = {
    unicode: false,
    interSplit: "(\p{Unified_Ideograph=yes})+"
};

let uf = new ufuzzy(opts)

const list = ['Javascript太棒了', 'Javascript素晴らしい', 'Javascript너무훌륭해훌륭해']

let strs = {cn: '太棒', jp: 'らし', kr: '훌륭'}

for (let i in strs) {
    console.log(i + ':' + uf.filter(list, strs[i]));
}

If I don't specify interSplit with CJK full stack range as above
(which I've taken from here which see Unified_Ideograph),
the search is failed as returning null even if I using the latest
version of uFuzzy. It seems uFuzzy can not use CJK chars for
calculating matches.

I just make a little bit try, hope that's useful.

Could you also send me some text examples which cause the issue?
Then I can use this to replicate.

The contents in above code example may be enough?

There we already have a newer version. I could try a dependency
update to that and if that works, I can create a new release

I've tested this lastest commit yet, still throwing same error.

If not, could you consider creating your report at the uFuzzy GitHub issues?

Yeah, by the investigation, I think that should be but may be
duplicated with leeoniya/uFuzzy#14, thus that may be later or
... any suggestion issued to?

Thanks!
Briant

@Fannon
Copy link
Owner

Fannon commented Mar 27, 2023

Hi Briant,

ok, I got it. So if I can fix this issue by just configuring the interSplit regexp, then I'll do that. I'm just not sure on the consequences on this.

But since you obviously can check out and build this project, I'll create a PR and can merge it - before making a release out of it. Then you and I can test a bit.

@Fannon
Copy link
Owner

Fannon commented Mar 27, 2023

Ok, merged it. For me it seemed to work fine - could you also try out and verify?

c0001 added a commit to c0001/search-bookmarks-history-and-tabs that referenced this issue Mar 28, 2023
Issued by Fannon#87, we should provide this to let multilingual searching be
capable, robust and following user side specifications via doing more
tweaks of uFuzzy(@leeoniya/ufuzzy).

See also:
- leeoniya/uFuzzy#14
- https://github.com/leeoniya/uFuzzy#how-it-works
c0001 added a commit to c0001/search-bookmarks-history-and-tabs that referenced this issue Mar 28, 2023
Issued by Fannon#87, we should provide this to let multilingual searching be
capable, robust and following user side specifications via doing more
tweaks of uFuzzy(@leeoniya/ufuzzy).

See also:
- leeoniya/uFuzzy#14
- https://github.com/leeoniya/uFuzzy#how-it-works
c0001 added a commit to c0001/search-bookmarks-history-and-tabs that referenced this issue Mar 28, 2023
Issued by Fannon#87, we should provide this to let multilingual searching be
capable, robust and following user side specifications via doing more
tweaks of uFuzzy(@leeoniya/ufuzzy).

See also:
- leeoniya/uFuzzy#14
- https://github.com/leeoniya/uFuzzy#how-it-works
c0001 added a commit to c0001/search-bookmarks-history-and-tabs that referenced this issue Mar 28, 2023
Issued by Fannon#87, we should provide this to let multilingual searching be
capable, robust and following user side specifications via doing more
tweaks of uFuzzy(@leeoniya/ufuzzy).

See also:
- leeoniya/uFuzzy#14
- https://github.com/leeoniya/uFuzzy#how-it-works
@c0001
Copy link
Contributor Author

c0001 commented Mar 28, 2023

(Sorry for above duplicated commit references caused by my four times rebase, but I don't know how to remove them)

@Fannon

Thanks, that's seems worked for me as well, but I think that's not a usual way since interSplit is not a exposed fixed user option based on uFuzzy's README, and the commentary of interSplit (here) just say that's a term segmentation & punct/whitespace merging and defaults to [^A-Za-z0-9']+ which let me thought of that the p{Unified_Ideograph}=yes regexp is considerred as a divider filter rather than a mactch? If so, the fix of this may not be proper.

Thus, I make a new pull-request to provide a ufuzzyOptions user side customization which let us can do more test and tricks, is that OK?

@leeoniya
Copy link

yes, interSplit should only be used for white-space and punctuation. the default setting excludes apostrophes so that don't does not get matched as don t.

https://stackoverflow.com/questions/4328500/how-can-i-strip-all-punctuation-from-a-string-in-javascript-using-regex

@c0001
Copy link
Contributor Author

c0001 commented Mar 28, 2023

Hi uFuzzy author @leeoniya :

Thanks for issuing the internal mechanism of option interSplit, and I've checked it with the regexp range covering whitespace and unicode punctuations, and worked for most of cases for example run bellow test.js:

const process = require('process')
const ufuzzy  = require('@leeoniya/ufuzzy')

let opts = {
    interSplit: "(^\\s|\\p{P})+",
};

let uf = new ufuzzy(opts)

const list = ['Javascript 太棒了', 'Javascript素晴らしい', 'Javascript너무훌륭해훌륭해']

let strs = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '太 棒', '素\t晴 ら', '훌”륭\t\t해 훌']

let c = 0
for (let i in strs) {
    let j = uf.filter(list, strs[i])
    if (!j) {
        console.log('\'' + strs[i] + '\'' + ': ' + j);
        c++;
    }
}
if (c!=0) {
    console.log("--> Failure of some letter search, Abort!")
    process.exit(1)
} else {
    console.log("--> All of searching available, Done!")
}

This given the result of

>$ node test.js 
--> All of searching available, Done!

But if I changed interSplit regexp to ([^a-zA-Z']|\\s|\\p{P})+ that plusses your defaults, the result is failure of all CJK patterns:

>$ node test_ufuzzy.js 
'太 棒': null
'素     晴 ら': null
'훌”륭          해 훌': null
--> Failure of some letter search, Abort!

That's why?

With checking uppon my previous post in this issue, if I change the interSplit regexp (\p{Unified_Ideograph=yes})+ to either [^a-zA-Z'\p{Unified_Ideograph=yes}]+ or ([^a-zA-Z']|\p{Unified_Ideograph=yes})+ for a compatibility test, the test is still failure with all of cn,jp and kr search result are null, even if I replace [^a-zA-Z] to purely simple "not" logical regexp char classes bracket (such as [^a]). Unless I remove the ^. It seems like the ^ keyword will also affect the result?

In the other hand, can you give more description of uFuzzy's designation of interSplit and either for its sibling intraSplit since I'm confused about what that is based on the understanding of interSplit which you've pointed out. Or return to the original issue, my question is that does uFuzzy can fully support fuzzy searching with extending Latin/Roman alphabet with CJK terms and any existent workaround can be released out here?

Thanks!
Briant

@Fannon
Copy link
Owner

Fannon commented Mar 28, 2023

Thanks @c0001 for working on further clarifying this!

So I'll revert my (then premature) fix in the master branch and wait until we found a good solution that works well for both Latin/Roman and CJK search items.

I do like the idea of just exposing uFuzzyOptions directly for power users. But I hope that we also find a more general, user friendly fix :)

@c0001
Copy link
Contributor Author

c0001 commented Mar 29, 2023

According to #91, We have flexible uFuzzyOptions user
customization place now, so the issue can be halt temporarily.

The problem is for those users are not familiar with uFuzzy
specification that how we make a guide for them to be involved in. Or
based on this discussion of #90 , a simple supportCJK may be
approachable for common user. To my wish, let user-end be with
non-feeling of this change (i.e. find a robust way to make both of
them workable internally) may be Best.

Further more for any one first time dating with this problem, this
previous post
may be useful.

Thanks!
Briant

@Fannon
Copy link
Owner

Fannon commented Mar 29, 2023

@c0001 : If you have any further insights how to best configure uFuzzy for your case, feel free to propose a better README fix or mention. In case we find a solution which works general-purpose, we can either have it in by default or if it comes with drawbacks, then have it as a dedicated option (like enableCJK: true)

Shall we close this or keep it open? Your call.

@c0001
Copy link
Contributor Author

c0001 commented Mar 29, 2023

Of course I will, I've made some changes to that but hadn't got the best way even if following the leeoniya's suggestion to use fully punctuations and whitespace binding to interSplit since the new unterminated chars classes problem is found out. Uh...relax :)

Shall we close this or keep it open? Your call.

I've preferred to halt this issue as Close as not planned since there's no exact fix found here. I halt this now, if any problem is presenting on, whenever reopening is welcom.

@c0001 c0001 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2023
@Fannon
Copy link
Owner

Fannon commented Mar 30, 2023

Ok, great. Thanks again!
I've now created a new release and submitted it to the stores. Usually this takes a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants