Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined" #87

c0001 · 2023-03-23T14:39:00Z

Hi maintainer:

For recent 1.8.5 release checking out on my usage side, all of fuzzy search method (i.e. any search mode unless precise) thru CJK (i.e. chinese, japan and korean) prefix gives me an error message of:

Error Message:  undefined

without any candidates listed anymore.

I've spent some time to investigate the problem by checking out recently git commits for locating the final no-problem commit, it is 98ee21e. And I've compared the latest commit upates content again with that no-problem commit to found that the only break changes between is the deps updates via commit 86ee9c6.

In my test, when I revert the deps updates via commit 86ee9c6 based on latest commit (i.e. the main branch), the problem is gone.

I've not test which dep's updating caused this issue, hope for founding and helping me how to fix this problem.

Thanks.

The text was updated successfully, but these errors were encountered:

Fannon · 2023-03-26T17:52:56Z

Hi @c0001,

thanks for the report and the investigation!
From what you write, I'm rather sure that the problem is actually coming from the https://github.com/leeoniya/uFuzzy library. There we already have a newer version. I could try a dependency update to that and if that works, I can create a new release. If not, could you consider creating your report at the uFuzzy GitHub issues?

There seems to be one issue already: leeoniya/uFuzzy#14

Could you also send me some text examples which cause the issue? Then I can use this to replicate.

Thanks!
Simon

c0001 · 2023-03-27T10:05:25Z

I've check the leeoniya/uFuzzy#14 issue which says ufuzzy may
need a more complex regexp specified for its option entry while used
for CJK terms based search, and below is my experimentation snippet:

const ufuzzy = require('@leeoniya/ufuzzy')

let opts = {
    unicode: false,
    interSplit: "(\p{Unified_Ideograph=yes})+"
};

let uf = new ufuzzy(opts)

const list = ['Javascript太棒了', 'Javascript素晴らしい', 'Javascript너무훌륭해훌륭해']

let strs = {cn: '太棒', jp: 'らし', kr: '훌륭'}

for (let i in strs) {
    console.log(i + ':' + uf.filter(list, strs[i]));
}

If I don't specify interSplit with CJK full stack range as above
(which I've taken from here which see Unified_Ideograph),
the search is failed as returning null even if I using the latest
version of uFuzzy. It seems uFuzzy can not use CJK chars for
calculating matches.

I just make a little bit try, hope that's useful.

Could you also send me some text examples which cause the issue?
Then I can use this to replicate.

The contents in above code example may be enough?

There we already have a newer version. I could try a dependency
update to that and if that works, I can create a new release

I've tested this lastest commit yet, still throwing same error.

If not, could you consider creating your report at the uFuzzy GitHub issues?

Yeah, by the investigation, I think that should be but may be
duplicated with leeoniya/uFuzzy#14, thus that may be later or
... any suggestion issued to?

Thanks!
Briant

Fannon · 2023-03-27T18:30:47Z

Hi Briant,

ok, I got it. So if I can fix this issue by just configuring the interSplit regexp, then I'll do that. I'm just not sure on the consequences on this.

But since you obviously can check out and build this project, I'll create a PR and can merge it - before making a release out of it. Then you and I can test a bit.

Fannon · 2023-03-27T19:09:39Z

Ok, merged it. For me it seemed to work fine - could you also try out and verify?

Issued by Fannon#87, we should provide this to let multilingual searching be capable, robust and following user side specifications via doing more tweaks of uFuzzy(@leeoniya/ufuzzy). See also: - leeoniya/uFuzzy#14 - https://github.com/leeoniya/uFuzzy#how-it-works

c0001 · 2023-03-28T07:59:57Z

(Sorry for above duplicated commit references caused by my four times rebase, but I don't know how to remove them)

@Fannon

Thanks, that's seems worked for me as well, but I think that's not a usual way since interSplit is not a exposed fixed user option based on uFuzzy's README, and the commentary of interSplit (here) just say that's a term segmentation & punct/whitespace merging and defaults to [^A-Za-z0-9']+ which let me thought of that the p{Unified_Ideograph}=yes regexp is considerred as a divider filter rather than a mactch? If so, the fix of this may not be proper.

Thus, I make a new pull-request to provide a ufuzzyOptions user side customization which let us can do more test and tricks, is that OK?

leeoniya · 2023-03-28T12:12:21Z

yes, interSplit should only be used for white-space and punctuation. the default setting excludes apostrophes so that don't does not get matched as don t.

https://stackoverflow.com/questions/4328500/how-can-i-strip-all-punctuation-from-a-string-in-javascript-using-regex

c0001 · 2023-03-28T16:45:45Z

Hi uFuzzy author @leeoniya :

Thanks for issuing the internal mechanism of option interSplit, and I've checked it with the regexp range covering whitespace and unicode punctuations, and worked for most of cases for example run bellow test.js:

const process = require('process')
const ufuzzy  = require('@leeoniya/ufuzzy')

let opts = {
    interSplit: "(^\\s|\\p{P})+",
};

let uf = new ufuzzy(opts)

const list = ['Javascript 太棒了', 'Javascript素晴らしい', 'Javascript너무훌륭해훌륭해']

let strs = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '太 棒', '素\t晴 ら', '훌”륭\t\t해 훌']

let c = 0
for (let i in strs) {
    let j = uf.filter(list, strs[i])
    if (!j) {
        console.log('\'' + strs[i] + '\'' + ': ' + j);
        c++;
    }
}
if (c!=0) {
    console.log("--> Failure of some letter search, Abort!")
    process.exit(1)
} else {
    console.log("--> All of searching available, Done!")
}

This given the result of

>$ node test.js 
--> All of searching available, Done!

But if I changed interSplit regexp to ([^a-zA-Z']|\\s|\\p{P})+ that plusses your defaults, the result is failure of all CJK patterns:

>$ node test_ufuzzy.js 
'太 棒': null
'素     晴 ら': null
'훌”륭          해 훌': null
--> Failure of some letter search, Abort!

That's why?

With checking uppon my previous post in this issue, if I change the interSplit regexp (\p{Unified_Ideograph=yes})+ to either [^a-zA-Z'\p{Unified_Ideograph=yes}]+ or ([^a-zA-Z']|\p{Unified_Ideograph=yes})+ for a compatibility test, the test is still failure with all of cn,jp and kr search result are null, even if I replace [^a-zA-Z] to purely simple "not" logical regexp char classes bracket (such as [^a]). Unless I remove the ^. It seems like the ^ keyword will also affect the result?

In the other hand, can you give more description of uFuzzy's designation of interSplit and either for its sibling intraSplit since I'm confused about what that is based on the understanding of interSplit which you've pointed out. Or return to the original issue, my question is that does uFuzzy can fully support fuzzy searching with extending Latin/Roman alphabet with CJK terms and any existent workaround can be released out here?

Thanks!
Briant

Fannon · 2023-03-28T18:06:16Z

Thanks @c0001 for working on further clarifying this!

So I'll revert my (then premature) fix in the master branch and wait until we found a good solution that works well for both Latin/Roman and CJK search items.

I do like the idea of just exposing uFuzzyOptions directly for power users. But I hope that we also find a more general, user friendly fix :)

c0001 · 2023-03-29T09:16:25Z

According to #91, We have flexible uFuzzyOptions user
customization place now, so the issue can be halt temporarily.

The problem is for those users are not familiar with uFuzzy
specification that how we make a guide for them to be involved in. Or
based on this discussion of #90 , a simple supportCJK may be
approachable for common user. To my wish, let user-end be with
non-feeling of this change (i.e. find a robust way to make both of
them workable internally) may be Best.

Further more for any one first time dating with this problem, this
previous post may be useful.

Thanks!
Briant

Fannon · 2023-03-29T17:35:16Z

@c0001 : If you have any further insights how to best configure uFuzzy for your case, feel free to propose a better README fix or mention. In case we find a solution which works general-purpose, we can either have it in by default or if it comes with drawbacks, then have it as a dedicated option (like enableCJK: true)

Shall we close this or keep it open? Your call.

c0001 · 2023-03-29T21:21:56Z

Of course I will, I've made some changes to that but hadn't got the best way even if following the leeoniya's suggestion to use fully punctuations and whitespace binding to interSplit since the new unterminated chars classes problem is found out. Uh...relax :)

Shall we close this or keep it open? Your call.

I've preferred to halt this issue as Close as not planned since there's no exact fix found here. I halt this now, if any problem is presenting on, whenever reopening is welcom.

Fannon · 2023-03-30T05:39:35Z

Ok, great. Thanks again!
I've now created a new release and submitted it to the stores. Usually this takes a few days.

Fannon mentioned this issue Mar 26, 2023

Update dependencies #88

Merged

Fannon mentioned this issue Mar 27, 2023

Adding interSplit option to fuzzy search #89

Merged

c0001 mentioned this issue Mar 28, 2023

Add uFuzzy user customizable option: ufuzzyOptions #90

Closed

Fannon mentioned this issue Mar 29, 2023

Allow passing custom uFuzzy options. #91

Merged

c0001 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2023

Fannon mentioned this issue Apr 21, 2023

multiple languages support? #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined" #87

Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined" #87

c0001 commented Mar 23, 2023

Fannon commented Mar 26, 2023 •

edited

Loading

c0001 commented Mar 27, 2023 •

edited

Loading

Fannon commented Mar 27, 2023

Fannon commented Mar 27, 2023

c0001 commented Mar 28, 2023 •

edited

Loading

leeoniya commented Mar 28, 2023

c0001 commented Mar 28, 2023 •

edited

Loading

Fannon commented Mar 28, 2023 •

edited

Loading

c0001 commented Mar 29, 2023

Fannon commented Mar 29, 2023

c0001 commented Mar 29, 2023

Fannon commented Mar 30, 2023 •

edited

Loading

Deps updates since commit 86ee9c6 cause CJK characters *fuzzy* search throw error message of "undefined" #87

Deps updates since commit 86ee9c6 cause CJK characters *fuzzy* search throw error message of "undefined" #87

Comments

c0001 commented Mar 23, 2023

Fannon commented Mar 26, 2023 • edited Loading

c0001 commented Mar 27, 2023 • edited Loading

Fannon commented Mar 27, 2023

Fannon commented Mar 27, 2023

c0001 commented Mar 28, 2023 • edited Loading

leeoniya commented Mar 28, 2023

c0001 commented Mar 28, 2023 • edited Loading

Fannon commented Mar 28, 2023 • edited Loading

c0001 commented Mar 29, 2023

Fannon commented Mar 29, 2023

c0001 commented Mar 29, 2023

Fannon commented Mar 30, 2023 • edited Loading

Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined" #87

Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined" #87

Fannon commented Mar 26, 2023 •

edited

Loading

c0001 commented Mar 27, 2023 •

edited

Loading

c0001 commented Mar 28, 2023 •

edited

Loading

c0001 commented Mar 28, 2023 •

edited

Loading

Fannon commented Mar 28, 2023 •

edited

Loading

Fannon commented Mar 30, 2023 •

edited

Loading